Skip to content

fix: prevent data loss from dirty_since race in push worker#19

Open
miyannishar wants to merge 1 commit into
supermemoryai:mainfrom
miyannishar:fix/dirty-since-cas-race-condition
Open

fix: prevent data loss from dirty_since race in push worker#19
miyannishar wants to merge 1 commit into
supermemoryai:mainfrom
miyannishar:fix/dirty-since-cas-race-condition

Conversation

@miyannishar
Copy link
Copy Markdown

@miyannishar miyannishar commented May 15, 2026

Fix: Prevent data loss from dirty_since race in push worker

Fixes #18

The Problem

The push worker unconditionally clears dirty_since = None after wait_until_done() returns (up to 30 seconds). If the user writes to the same file during that window:

  1. The new write sets dirty_since = T2
  2. The push worker finishes and calls set_dirty_since(ino, None) -> destroys T2
  3. The pull reconciler sees dirty_since = None -> overwrites local file with stale server content
  4. User's latest edits are silently lost

The Fix

Compare-and-swap instead of unconditional clear:

  • Db::clear_dirty_since_if_unchanged(ino, expected_ms) - New CAS method that only clears dirty_since if it still matches the value captured at job claim time. If a concurrent write updated it, the CAS fails and the protection flag is preserved.

  • PushJob::dirty_since_at_claim - New field that snapshots dirty_since when the job is claimed from the queue, providing the "expected" value for the CAS.

  • wait_until_done return value - Previously ignored. Now only stamps last_status = "done" when the server actually confirmed completion; on timeout, defers to the inflight poller.

Test Results

test result: ok. 160 passed; 0 failed; 1 ignored

All 160 tests pass, including:

  • test_dirty_since_cas_prevents_data_loss - Simulates the full race sequence and verifies reconcile_one returns SkippedDirty (not Updated)
  • test_dirty_since_cas_clears_when_unchanged - Verifies the CAS clears normally when no concurrent write occurs

The push worker unconditionally cleared dirty_since after wait_until_done,
creating a 30-second window where concurrent writes could be silently lost:

1. Push worker sends version A, enters wait_until_done (up to 30s)
2. User writes version B → dirty_since updated to T2
3. wait_until_done returns → set_dirty_since(ino, None) destroys T2
4. Pull reconciler sees dirty_since=None → overwrites with stale version A
5. User's version B is silently lost

Fix: Replace unconditional set_dirty_since(None) with a compare-and-swap
that captures dirty_since at job claim time and only clears it if unchanged.
If a concurrent write updated dirty_since during the wait window, the CAS
refuses to clear it, and the pull reconciler correctly skips the overwrite.

Also fixes: wait_until_done return value was being ignored — the code
stamped last_status='done' even on timeout. Now only stamps 'done' when
the server actually confirmed completion.

Changes:
- Add Db::clear_dirty_since_if_unchanged() (CAS method)
- Add dirty_since_at_claim field to PushJob (snapshot at claim time)
- Fix all 3 call sites in push.rs (PATCH, POST, binary upload)
- Handle wait_until_done timeout (defer to inflight poller)
- Add regression test proving the fix prevents data loss
- Add happy-path test proving CAS clears normally when unchanged
@miyannishar miyannishar force-pushed the fix/dirty-since-cas-race-condition branch from fd4a633 to 92b4503 Compare May 15, 2026 10:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Data loss: push worker unconditionally clears dirty_since after wait_until_done, allowing pull reconciler to overwrite concurrent writes

1 participant